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Speaker: Jose Cores Finotto 


My name is Jose Cores Finotto | work 
with the Infrastructure team at 
GitLab. 


І have been a part of the GitLab 
team since September 2018. 


Background in large organizations 
with extensive experience in 
Infrastructure, especially in 
relational databases. 


Speaker: Nikolay Samokhvalov 
o Database systems: — 
PRI а.о 
o 2002-2005: 


o since 2005: PostgreSQL 


o Worked on XML data type and functions (2005-2007) 


o Long-term community activist - ZRuPostgres, Postgres.tv 


o Conferences Program Committee allhighload Qu PGIBZĘ etc. 
o Current business: zi Postg res ai 


Agenda 


e GitLab 

e Architecture and challenges 
e Performance analysis 

e postgres-checkup 


e joe Bot & Database Lab 


Gitlab Values 


о оо о о о 


Collaboration Results Efficiency Diversity Iteration Transparency 
Work asynchronously Track outcomes, Straightforward Remote-only tends toward Minimum Viable Everything at GitLab 
with fully remote not hours solutions win. global diversity, but we still Change (MVC) if the is public by default: 
workforce (org) Complexity slows have a ways to go. change is better than Strategy, Roadmap, 
Use GitLab to build cycle time. the existing solution, Quarterly Goals, 
GitLab, there's an Issue Hire those who add to culture, ship it. Handbook, and Issue 
and/or Merge Request not those who fit with it. We Trackers 

for everything want cultural diversity 


instead of cultural conformity. 


The open source project 


The open source project 


Used by more than A community of 
organizations code contributors 


We release every month on the 22nd and there is a publicly viewable direction for the product. 
Learn more from our blog ? 


The company 


The company 


GitLab Inc. is an open-core company that sells subscriptions that offer more features and support for GitLab. 


Learn about open core > 


GitLab, the product is a complete DevOps platform, delivered as a single application, fundamentally 
changing the way Development, Security, and Ops teams collaborate. 


Learn more about our product > 


AIL remote with Over Located in 


1297 30 ~ 
"eit million 


nctimatad raeeictararl:i:icare 


countries 


The Company 


2011 2015 


GitLab, the open source project began. We joined Y Combinator and started growing faster. 
Join our team. 


Most of our internal procedures can Our mission is to change all creative 

be found in a publicly viewable « » work from read-only to read-write so 

5000+ page handbook and our that everyone can contribute. This 

objectives are documented in our is part of our overall strategy. 

OKRs. 

Our values are Collaboration, Our Tanuki (Japanese for raccoon 

@) Results, Efficiency, Diversity, dog) logo symbolizes our values 

(9 Inclusion & Belonging , Iteration, with a smart animal that works in a 

and Transparency (CREDIT) and group to achieve a common goal, 

these form an important part of our you can download it on our press 


culture. page. 


Product 


Pricing 


Resources 


Blog 


The definitive guide to remote work Download the playbook 


Support 


Install GitLab 


Explore 


Signin С Get free trial 


GitLab isa 
complete 
DevOps 
platform, 
delivered 
as a single 
application. 
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Discover a more streamliped way to work 
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Create 
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Management 
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Snippets 
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Verify 
Continuous 
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Code Quality 


Code Testing 
and Coverage 


Load Testing 


Web 
Performance 


Usability Testing 


Accessibility 
Testing 


Merge Trains 


Package 


Package 
Registry 


Container 
Registry 


Helm Chart 
Registry 


Dependency 
Proxy 


Release 
Evidence 


Git LFS 


S 


Secure 
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Fuzz Testing 


Dependency 
Scanning 


License 
Compliance 


Secret Detection 


Vulnerability 
Management 
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Release 
Continuous 
Delivery 
Pages 
Review Apps 


Advanced 
Deployments 


Feature Flags 


Release 
Orchestration 


Secrets 
Management 


5576 
Configure 


Auto DevOps 


Kubernetes 
Management 


ChatOps 
Serverless 


Infrastructure as 
Code 


Cluster Cost 
Management 
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Monitor 
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Metrics 
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Product 
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Ргоїесї 
Container 
Scanning 


Container Host 
Security 


Container 


Network 
Security 
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Feature development matrix 


Exponential rate of product improvement ET IDE 
e Incremental 
Deploy 
e Auto DevOps 
e Licence 
Management 
e Tracing 
e Feature 
е Board Flags 
Milestones e Load 
e Deploy Testing 
Boards * Anomaly 
Prometheus Alerts 
Integration ” саты 4 
e Pipelines * Burndown . Aewg Н 
* Container Charts e Framework 
Registry • Canary e App Control 
e Environments Deploy Panel 
e Issue Boards „ссн Quality e Binary 
e Oauth e Сусе • Ашо ОеуОрѕ Repository 
e Group Support Analytics Beta e APM/Tracing 
е GitLab Shell Milestones e GitLab.com  * Time. e Prod 
e CI e Groups e Audit Logs e CI Runner Tracking Ы Kubemetes Monitoring 
• Source Code e Wiki e Side by Side — e Multiple LDAP ө Docker e Review Apps œ Epics e Error Tracking 
e Issues e Labels Diff servers Support * Auto Deploy • SAST e Logging 
2011 2012 2013 2014 2015 2016 2017 2018 


about.gitlab.com/direction 
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GitLab.com in numbers: 


We have a hosted version of Gitlab: 


e Over 40 million daily git pull operations. 

e More than 6k git requests per second. 

e 750.000 git pushes a day. 

e 60k to 80k transactions per second on the database 

e Gdatabasereplicas and 1 primary 

e Database size: 14 TiB 

e Hardware architecture GCP 96 cores with 624 GiB of RAM. 


Current Architecture 


Read-Write 
Async Traffic 


Read-Write 
Sync Traffic 


Read-Only 
Traffic 


Internal Load 
Balancer 


Consul Agent 


Postgresql DB 


Postgresql D 
Primary ostgresql DB 


Secondary 


Postgresql DB 
Secondary 


SS Е 
= = 
Consul Server |=) 


Internal Load теле? 
Application Load Balancing 


Consul Agent 


Postgresql DB 
Secondary 


Database performance peak - 12 of January - 16:06 AM 


The following CPU utilization peak started at 16:05, reaching 8796: 


environment — gprd Host | patroni-06-db-gprd.c.gitlab-production.internal ~ 
CPU usage 


a 
в 
© 
E 


% of all CPUs 
> 
s 
© 
F: 


0% 
15:58 15:59 16:00 16:01 


iowait irq nice softirq steal system user 


Evaluate the analysis report, metrics and queries. If applies, create new issues with the label infradev or datastores to propose new 


improvements to the database cluster overall. 


Performance degradation analysis 


Jose Finotto @Finotto - 1 day ago owe © O 2 


We had the following top 10 statements by total time in execution during this peak: 


Query: 


topk(10, 
sum by (queryid) ( 
rate(pg stat statements seconds totalienv-"gprd", monitor-"db", type-"patroni",instance-"patroni-06-db-gp 
) 


In this analysis, we are considering a 15 minutes interval. 


https://thanos-query.ops.gitlab.net/graph?gO.range іприі-15т%00.епа іприі-2021-01- 
12%2016%3A15&g0.step_input=10&g0.max_source_resolution=0s&g0.expr=topk(10%2C%20%0A%20%20sum%20by%20 (queryid)%2 
0(%0A%20%20%20%20rate(pg_stat_statements_seconds_total%7Benv%3D%22gprd%22%2C%20monitor%3D%22db%22%2C%20typ 
e%3D%22patronin22%2Cinstance%3D%22patroni-06-db-gprd.c.gitlab- 
production.internal%3A9187%22%7D%5B1m%5D)%0A%20%20)%0A) &g0.tab=0 


Performance degradation analysis 


Jose Finotto @Finotto - 1 day ago Owner О 0 2 : 
The outputs аге: 
О Enabie query history 
Load time: 289r 
topk(10, Resolution: 10s 


sum by (queryid) ( Total time series 
rate(pg. stat statements seconds total(env-"gprd", monitor-"db", type="patroni",instance="patroni-O6-db-gprd.c.gitlab-production.internal:9187"}[1m]) 


) 


ec 
- insert metric at cursor - + СЄ; deduplication С partial response 


Graph Console 


— 15m ж « 2021-01-12 16:15 » 10 O stacked [ Only raw data v 


Performance degradation analysis 


O Enable query history 


topk(10, 
sum by (queryid) ( 


rate(pg stat statements seconds total(env-"gprd", monitor="db", typez"patroni"iinstancez"patroni-06-db-gprd.c.gitlab-production.internal:9187") [1m]) 


) 
) 
- insert metric at cursor - * 


Graph | Console 


« 2021-01-12 16:15:00 


Element 
{queryid="3926004648916863976"} 
{queryid="-6386890822646776524"} 
{queryid="7164302182213446947"} 
{queryid="6507699644791286491"} 
{queryid="9095629593792855100"} 
{queryid="-402488551284107289"} 
{queryid="1712385180720443674"} 
{queryid="2298083782068675032"} 
{queryid="-5002940052336095544"} 
{queryid="7366711010424350814"} 


Add Graph 


С deduplication С partial response 


Value 
0.8178287140222235 
0.6909796111596127 
0.5237485621202116 
0.2640517462795186 
0.2503059345329853 
0.23028561521334467 
0.20701351823647401 
0.157706000044224 
0.12475411511170224 
0.12231413964376164 


Load time: 183ms 
Resolution: 10s 
Total time series: 1 


Remove Grap 
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Performance degradation analysis 


2 Jose Finotto @Finotto - 1 day ago 


Queryld 


3926004648916863976 


-6386890822646776524 


7164302182213446947 


6507699644791286491 


Owner © O 2 


Those querylds аге the following SQL statements: 


Query 


SELECT "ci builds"* FROM "ci builds" INNER JOIN "projects" ON "ргојесїѕ"."іа" = "ci builds""project 
сі builds.project id = project features.project id LEFT JOIN (SELECT "ci builds""project id", count() 
"cj builds""type" = $1 AND ("ci builds""status" IN ($2)) AND "ci builds"."runner id" IN (SELECT "ci г 
"сі гиппегѕ“"гиппег_ type" = $3) GROUP BY "ci builds"."project id") AS project builds ON ci builds.p 
("ci builds""status" IN ($4)) AND "ci Бийав"“гиппег id" IS NULL AND "projects""shared runners en 
= $6 AND (project features.builds access level IS NULL or project features.builds access level > %; 
("projects"."visibility level" = $9 OR (EXISTS (WITH RECURSIVE "base and ancestors" AS ((SELECT ' 
(namespaces.id - projects.namespace id)) UNION (SELECT "namespaces"* FROM "namespaces", "I 
"пагпеѕрасеѕ"."іа" = "base and ancestors"."parent id")) SELECT $10 FROM "base and ancestors" А! 
namespace statistics ON namespace statistics.namespace id - namespaces.id WHERE "namespace 
(COALESCE(namespaces.shared runners minutes limit, $11, $12) - $13 OR COALESCE(namespace. 
COALESCE((namespaces.shared runners minutes limit + COALESCE(namespaces.extra shared run 
COALESCE(namespaces.extra shared runners minutes limit, $17)), $18) * $19)))) AND (NOT EXISTS 
"taggings""taggable type" = $21 AND "taggings"."context" = $22 AND (taggable id = ci builds.id) Al 
ORDER BY COALESCE (project builds.running. builds, $25) ASC, ci builds.id ASC 

/application:web, correlation_id:O1TEVX3GF3VGAVE6T YFMR82EJFN/ 


SELECT "users"* FROM "users" INNER JOIN "project_authorizations" ON "users"."id" = "project_auth 
"project_authorizations"."project_id" = $1 /application:web,correlation_id:Lmz5Aaf8Vpa/ 


UPDATE "ci builds" SET "runner. id" = 380987, "status" = 'running', "started at" = '2020-10-29 21:0C 
"updated at" = '2020-10-29 21:00:54.568589', "lock version" = 2 WHERE "ci builds"."id" = 8201577 
/application:web, correlation_id:4zeQHF2IXC9/ 


SELECT SUM((("project statistics"."repository size" "project statistics"."Ifs objects size") - "projec 
INNER JOIN routes rs ON rs.source id = projects.id AND rs.source type = 'Project' INNER JOIN "proji 
"project statistics""project id" = "projects"."id" WHERE (rs.path LIKE 'gitlab-org/?e') AND ("project s 
"project statistics"."Ifs objects size") > "projects""repository size limit" AND "projects"."repository s 
/application:web,controller:merge_requests,action:index, correlation_id:HIfxW7Ir8b1/ 


Ls 


Performance degradation analysis 


Jose Finotto @Finotto - 1 day ago Owe OO 2 


А We had the following top 10 statements by total calls in execution during this peak: 


Query: 


topk(10, 
sum by (queryid) ( 
rate(pg stat statements calls totalienv-"gprd", monitor="db", typez"patroni",instance-"patroni-06-db-gprd 
) 


In this analysis, we are considering a 15 minutes interval. 


https://thanos-query.ops.gitlab.net/graph?g0.range_input=15m&g0.end_input=2021-01- 
12%2016%3A15&g0.step_input=10&g0.moment_input=2021-01- 
08%2014%3A15%3A00&g0.max_source_resolution=0s&g0.expr=topk(10%2C%20%0A%20%20sum%20by%20 (queryid)%20(%0A%20 
%20%20%20rate(pg_stat_statements_calls%7Benv%3D%22gprd%22%2C%20monitor%3D%22db%22%2C%20type%3D%22patroniv2 
2%2Cinstance%3D%22patroni-06-db-gprd.c.gitlab-production.internal%3A9187%22%7D%5B1M%5D)*%0A%20%20)%0A)&g0.tab=0 


Edited by Jose Finotto 1 day ago 


Performance degradation analysis 


Jose Finotto GFinotto - 1 day ago ower) OO 2 : 
The outputs are: 
O Enable query history 
Load time: 283ms 
topk(10, Resolution: 10s 


sum by (queryid) ( Total time series: 17 
rate(pg stat statements calls(env-"gprd", monitor="db", type="patroni",instance="patroni-06-db-gprd.c.gitlab-production.internal:9187"}[1m]) 
) 


) e 
- insert metric at cursor - * © deduplication G partial response 


Graph Console 


= | 15m + « | 2021-01-12 16:15 » 10 O stacked Onlyrawdata 7, 


10k 


Performance degradation analysis 


Those querylds are the following SQL statements: 


Queryld 


833913155023572892 


73367110635711796 


6769309683899657633 


6974950735891200787 


6749620766035719574 


6504150523421693673 


-2372450153195223637 


Query 
SELECT $1 


SELECT "projects"* FROM "projects" WHERE "projects"."id" = $1 LIMIT $2 
/application:web,controller:issues,action:index, correlation_id:tt4UcIFKFU9/ 


SELECT "routes"* FROM "routes" WHERE "routes"."source_id" = $1 AND "routes"."source_type" = 
$2 LIMIT $3 /application:web, controller:issues, action:index, correlation_id:tt4UcIFKFU9/ 


SELECT "namespaces"* FROM "namespaces" WHERE "namespaces"."id" = $1 LIMIT $2 
/application:web,correlation_id:e7d284e6-07ff-4c0e-ae4a-e6880d46b20a/ 


SELECT "taggings"* FROM "taggings" WHERE "taggings"."taggable_id" = $1 AND 
"taggings"."taggable_type" = $2 
Japplication:web,controller:projects,action:show,correlation id:ZiDjjveMIXa/ 


SELECT "tags"* FROM "tags" INNER JOIN "taggings" ON "tags"."id" = "taggings"."tag id" WHERE 
"taggings"."taggable_id" = $1 AND "taggings""taggable type" = $2 AND (taggings.context = $3 
AND taggings.tagger id IS NULL) /application:web,correlation id:dnT2GXhKuX2/ 


SELECT $1 AS one FROM ((SELECT "ci runners"* FROM "ci runners" INNER JOIN 

"ci runner. projects" ON "ci runner. projects""runner id" = "ci_runners"."id" WHERE 

"ci runner projects""project id" = $2) UNION ALL (SELECT "ci runners"* FROM "ci runners" 

INNER JOIN "ci runner namespaces" ON "ci runner namespaces" "runner іа" = "ci гиппегѕ"."іа" 

INNER JOIN "namespaces" ON "namespaces"."id" = "ci runner namespaces""namespace id" 

AND "namespaces""type" = $3 WHERE "namespaces""id" IN (WITH RECURSIVE 

"base and ancestors" AS ((SELECT "namespaces"* FROM "namespaces" INNER JOIN "projects" 

ON "projects"."namespace id" = "namespaces"."id" WHERE "namespaces""type" = $4 AND 

"projects""id" = $5) UNION (SELECT "namespaces"* FROM "namespaces", "base and ancestors" 21 


БЕРЕ еее Ра PW RE NT ла 


Postgres-checkup 


Nikolay апа his team develop роѕідгеѕ-сһескир 
(https://gitlab.com/postgres-ai/postgres-checkup) -- a tool for 
automated  health-checks of Postgres databases, that 
contains: 


e 28 reports, checking various aspects of Postgres production database health and 
performing detailed SQL workload analysis. 

e Reports contain 3 detailed parts: observations, conclusions, апа 
recommendations. 

e Very lightweight checks, unobtrusive activities working well under heavy load, in 
large databases. Does not require any setup on the servers. 


e Multi-node analysis: the master is checked together with its replicas. 


postgres-checkup 


e Weekly tech audit reports that augment the existing monitoring (prometheus, postgres exporter, 
grafana, thanos): 


o track Postgres and components versions 

o track settings and setting deviations 

o  bloat control (tables, indexes) 

o index health (invalid, unused, redundant, etc) 
o deep query analysis 

o object sizes 


o  int4 PKs 


o .. апа more 


H002 Unused Indexes 


Observations 


Data collected: 2021-01-11 13:40:38 +0000 UTC 
Current database: gitlabhq production 


Stats reset: 6 mons 27 days 14:26:00 ago (2020-06-13 23:13:01 +0000 UTC) 
Never Used Indexes 


The list is limited to 50 items. Total: 178. 


10.220.16.106 10.220.16.101 10.220.16.102 10.220.16.103 10.220.16.104 10.220.16.105 10.220.16.107 10.220.16.108 Y Index 


= usage usage usage usage usage usage usage usage size 


Table size 


OTAL===== 165.66 GiB 7.27 TiB 
ci_builds index_ci_builds_on_protected 45.55 GiB 0.92 ТІВ 
сі builds index ci builds on user id and created at and type eq ci build 30.01 GiB 0.92 ТІВ 
ci_builds index ci builds on queued at 22.23 GiB 0.92 ТІВ 
merge_request_diffs index_merge_request_diffs_on_external_diff_store 8.41 GiB 27.54 GiB 
projects index_projects_on_runners_token 3.90 GiB 4.77 GiB 
projects index_projects_on_mirror_last_successful_update_at 3.90 GiB 4.77 GiB 
projects index_projects_on_last_repository_check_failed 3.89 GiB 4.77 GiB 
projects index_projects_on_pending_delete 3.89 GiB 4.77 GiB 
users index_users_on_accepted_term_id 2.07 GiB 3.55 GiB 
ci_runners index_ci_runners_on_is_shared 2.03 GiB 337.54 MiB 
merge_request_metrics index_mr_metrics_on_target_project_id_merged_at_time_to_merge 2.00 GiB 6.06 GiB 
notes note_mentions_temp_index 1.75 GiB 299.19 GiB 
namespaces index_namespaces_on_shared_and_extra_runners_minutes_limit 1.27 GiB 2.56 GiB 


namespaces index_namespaces_on_ldap_sync_last_update_at 1.22 GiB 2.56 GiB 


K003 Top-50 Queries by total time 


Observations 


Data collected: 2021-01-11 13:40:41 +0000 UTC 
Current database: gitlabhq production 


Master (10.220.16.106) 


Start: 2021-01-11T13:05:57.091968+00:00 
End: 2021-01-11T13:39:07.728772+00:00 
Period seconds: 1990.6368 

Period age: 00:33:10.636804 


Error (calls): 0.00 (0.00%) 
Error (total time): 0.00 (0.0096) 


The list is limited to 50 items. 


(qu ie id) Query Y Total time shared blks hit shared blks read shared blks dirtied 
SELECT "users".* FROM "users" INNER 
JOIN "project authorizations" ON 1,140,899.14 ms 7,371,979 33,889,906 blks 816,616 blks 4,870 blks 
1 "users"."id" = "project authorizations"."user id" 573.133 ms/sec 3.71K/sec 17.03K blks/sec 410.23 blks/sec 2.45 blks/sec 
(-6386890822646777000) WHERE "project authorizations"."project id" 15.679 ms/call 101.31/call 465.73 blks/call 11.22 blks/call 0.07 biks/call 
= $1 /"application:web,correlation id:Lmz5Aaf8Vpa"/ 15.5296 4.57% 2.86% 9.36% 0.20% 
Full query 


WITH RECURSIVE "namespaces cte" 

AS ((SELECT "namespaces"."id", 
"members"."access level" FROM "namespaces" 
INNER JOIN "members" ON "namespaces" "id" 
= “members"."source_id" WHERE "members"."type" 
= $1 AND "members"."source type" 

= $2 AND "namespaces" "type" = 

$3 AND "members"."user id" = 54 

AND "members"."requested at" IS 

NULL AND (access level >- $5)) 

UNION 


( 

SELECT "namespaces" "id", LEAST( 

"members"."access level", "group. group links"."group. access") 

AS access level FROM "namespaces" 995,280.30 ms 89,469,596 504,881,421 blks 33,166 blks 954 blks 
2 INNER JOIN "group. group. links" 499.981 ms/sec 44.95K/sec 253.63К blks/sec 16.66 blks/sec 0.48 blks/sec 
(-7232084447659837000) ON "group group links"."shared group id" 24.180 ms/call 2.18K/call 12.27K blks/call 0.81 biks/call 0.02 biks/call 

= "namespaces" "id" INNER JOIN 13.54% 55.44% 42.55% 0.38% 0.04% 

"members" ON "group group links"."shared with group id" 

= "members"."source id" AND "members"."source type" 

= $6 AND "members"."requested at" 

IS NULL AND "members"."user id" 

= $7 AND "members"."access level" 

» $8 WHERE "namespaces" "type" 

= 59) 

UNION 

(SELECT "nai 5"." 


м "нд 


K002 Workload Type ("The First Word" Analysis) 


Observations 


Data collected: 2021-01-11 13:40:41 +0000 UTC 
Current database: gitlabhg production 


Master (10.220.16.106) 


Start: 2021-01-11713:05:57.091968+00:00 
End: 2021-01-11713:39:07.728772+00:00 
Period seconds: 1990.6368 

Period age: 00:33:10.636804 


Error (calls): 0.00 (0.0096) 
Error (total time): 0.00 (0.0096) 


Workload 


# Әуре 


Calls Y Total time 


41,827,896 5,019,032.35 ms 
21.02K/sec 2521.320 ms/sec 
1.00/call 0.120 ms/call 
94.08% 68.28% 


752,897 1,066,397.37 ms 
378.22/sec 535.707 ms/sec 
1.00/call 1.416 ms/call 
1.69% 14.51% 


999,462 755,406.99 ms 
502.08/sec 379.480 ms/sec 
1.00/call 0.756 ms/call 
2.25% 10.28% 


update 


837,581 502,066.21 ms 
420.76/sec 252.214 ms/sec 
1.00/call 0.599 ms/call 
1.88% 6.83% 


select ... 40,689 7,361.69 ms 
for [no 20.44/sec 3.698 ms/sec 
key] 1.00/call 0.181 ms/call 
update 0.09% 0.10% 


Replica servers: 
Replica (10.220.16.101) 


Start: 2021-01-11713:05:51.048781+00:00 
End: 2021-01-11713:36:10.229216+00:00 
Period seconds: 1819.18044 

Period age: 00:30:19.180435 


# Workload Calls Y Total time 


type 


20,300,433 9,206,677.68 ms 
11.16K/sec 5060.893 ms/sec 


Rows 


68,804,876 
34.57К/вес 
1.64/call 
42.63% 


90,927,447 
45.68К/вес 
120.77/call 
56.34% 


741,537 
372.51/sec 
0.74/call 
0.46% 


873,121 
438.61/sec 
1.04/call 
0.54% 


40,689 
20.44/sec 
1.00/call 
0.03% 


Rows 


29,084,119 
15.99K/sec 


shared blks hit 


604,189,040 blks 
303.52K blks/sec 
14.44 blks/call 
50.9296 


512,703,313 blks 
257.56K blks/sec 
680.97 blks/call 
43.2196 


48,497,397 blks 
24.37K blks/sec 
48.52 biks/call 
4.09% 


21,046,160 blks 
10.58K blks/sec 
25.13 biks/call 
1.77% 


207,484 blks 
104.23 blks/sec 
5.10 blks/call 
0.02% 


shared blks hit 


4,829,058,098 blks 


2.66M blks/sec 


shared blks read shared blks dirtied 


6,911,927 blks 95,009 blks 
3.48K blks/sec 47.73 blks/sec 
0.17 blks/call 0.00 blks/call 
79.2196 3.81% 


33,166 biks 954 blks 
16.66 blks/sec 0.48 blks/sec 
0.04 blks/call 0.00 biks/call 
0.38% 0.04% 


1,211,149 blks 1,681,451 blks 
608.42 blks/sec 844.68 blks/sec 
1.21 blks/call 1.68 biks/call 
13.88% 67.36% 


561,911 blks 711,983 blks 
282.28 blks/sec 357.67 blks/sec 
0.67 biks/call 0.85 blks/call 
6.44% 28.52% 


7,997 blks 6,751 blks 
4.02 blks/sec 3.39 blks/sec 
0.20 blks/call 0.17 blks/call 
0.09% 0.27% 


shared blks read shared blks dirtied 


12,903,398 blks 0 biks 
7.10К biks/sec 0.00 blks/sec 


shared blks written 


3,913 blks 
1.97 blks/sec 
0.00 blks/call 
82.55% 


1 biks 
0.00 blks/sec 
0.00 blks/call 
0.02% 


533 blks 
0.27 blks/sec 
0.00 blks/call 
11.2496 


289 blks 
0.15 blks/sec 
0.00 blks/call 
6.10% 


4 biks 
0.00 blks/sec 
0.00 blks/call 
0.08% 


shared blks written 


401,782 blks 
220.86 blks/sec 


ЫК read time 


1,373,499.10 ms 
689.980 ms/sec 
0.033 ms/call 
73.2296 


6,402.21 ms 
3.216 ms/sec 
0.009 ms/call 
0.34% 


213,446.15 ms 
107.225 ms/sec 
0.214 ms/call 
11.38% 


282,057.24 ms 
141.692 ms/sec 
0.337 ms/call 
15.04% 


361.00 ms 
0.181 ms/sec 
0.009 ms/call 
0.02% 


blk_read_time 


866,803.16 ms 
476.480 ms/sec 


blk_write_time 


92.86 ms 
0.047 ms/sec 
0.000 ms/call 
81.72% 


0.04 ms 
0.000 ms/sec 
0.000 ms/call 
0.04% 


13.47 ms 
0.007 ms/sec 
0.000 ms/call 
11.85% 


7.18 ms 
0.004 ms/sec 
0.000 ms/call 
6.32% 


0.08 ms 
0.000 ms/sec 
0.000 ms/call 
0.07% 


bik_write_time 


10,641.69 ms 
5.850 ms/sec 


K001 Globally Aggregated Query Metrics 


Observations 


Data collected: 2021-01-11 13:40:41 +0000 UTC 
Current database: gitlabhq production 


Master (10.220.16.106) 


Start: 2021-01-11T13:05:57.091968+00:00 
End: 2021-01-11T13:39:07.728772+00:00 
Period seconds: 1990.6368 

Period age: 00:33:10.636804 


Error (calls): 0.00 (0.00%) 
Error (total time): 0.00 (0.00%) 


Calls Total time Rows shared blks hit 


44,458,525 
22.34K/sec 
1.00/call 
100.00% 


7,350,264.61 ms 
3692.419 ms/sec 
0.165 ms/call 
100.00% 


161,387,670 
81.08K/sec 
3.63/call 
100.00% 


1,186,643,394 blks 
596.12K blks/sec 
26.69 blks/call 
100.0096 


Replica servers: 
Replica (10.220.16.101) 


Start: 2021-01-11713:05:51.048781+00:00 
End: 2021-01-11713:36:10.229216+00:00 
Period seconds: 1819.18044 

Period age: 00:30:19.180435 


Calls Total time Rows shared blks hit 


21,518,698 
11.83K/sec 
1.00/call 
100.0096 


9,393,835.91 ms 
5163.774 ms/sec 
0.437 ms/call 
100.00% 


31,311,433 
17.22K/sec 
1.46/call 
100.00% 


4,887,880,546 biks 
2.69M blks/sec 
227.15 blks/call 
100.0096 


Replica (10.220.16.102) 


Start: 2021-01-11T13:05:52.314852+00:00 
End: 2021-01-11713:36:39.061152+00:00 
Period seconds: 1846.7463 

Period age: 00:30:46.7463 


Calls Total time Rows shared blks hit 


23,905,903 
12.95K/sec 


9,934,511.91 ms 
5379.468 ms/sec 


32,988,938 
17.87K/sec 


5,015,880,782 blks 
2.72M blks/sec 


shared blks read 


8,726,150 blks 
4.39K blks/sec 
0.20 blks/call 
100.0096 


shared blks read 


12,921,880 blks 
7.11K blks/sec 
0.60 blks/call 
100.00% 


shared blks read 


13,968,380 blks 
7.57K blks/sec 


shared blks dirtied 


2,496,148 blks 
1.26K blks/sec 
0.06 blks/call 
100.00% 


shared blks dirtied 


0 biks 
0.00 blks/sec 
0.00 biks/call 
0.00% 


shared blks dirtied 


0 biks 
0.00 blks/sec 


shared blks written 


4,740 blks 
2.38 blks/sec 
0.00 blks/call 
100.0096 


shared blks written 


402,350 blks 
221.17 blks/sec 
0.02 blks/call 
100.00% 


shared blks written 


514,997 blks 
278.87 blks/sec 


Ык read time 


1,875,765.69 ms 
942.294 ms/sec 
0.042 ms/call 
100.0096 


Ык read time 


868,082.59 ms 
477.183 ms/sec 
0.040 ms/call 
100.00% 


blk read time 


758,241.41 ms 
410.582 ms/sec 


Si Postgres.ail 


- clone DB of any size in a few seconds in bring them in any 
point of the DevOps lifecycle 


- automated (in CI) testing of DB migrations 
- guess-free SQL optimization 
- instant deployment of full-size staging apps 


Ag) Gitlab снемусом | ПігО NUTAND<. 
Сом СК =B UNGRES 


Non-production environment weaknesses are reasons of multiple development problems 


Development bottlenecks Frictionless development 
(with standard staging DB) (with Database Lab) 


X Bugs: difficult to reproduce, easy to miss € Bugs: easy to reproduce, and fix early 

X Not 100% of changes are well-verified Ф 100% of changes are well-verified 

X SQL optimization is hard Ф SQL optimization сап be done by anyone 
X Each non-prod big DB costs a lot Ф Non-prod DB refresh takes seconds 


X Non-prod DB refresh takes hours, days, weeks Ф Extra non-prod DBs doesn't cost a penny 


Database experiments - traditional approach 


e чт == am am em - ы. м т 
= eee 


м 


“ә шыш шыш шыш шыш шыш шыш / шы» 


ee A чт шан шы шш шш шз ма т 
к eee ee 


мч 


wee шыш шшш шыш шыш шыш шыш шә 


Production 


0): 


Database experiments on thin clones 


Production 


Thin clones - copy-on-write 


$8 Thick copy of production (any size] 
Ө Thin clone (size starts from 1 MB, depends on changes] 


C2 


№ 


Database experiments on thin clones - yes and no 


Yes 


- Check execution plan — Joe bot 
- EXPLAIN w/o execution 
- EXPLAIN (ANALYZE, BUFFERS) 

- (timing is different; structure and 
buffer numbers - the same) 

- Check DDL 
- index ideas (Joe bot) 
- auto-check DB migrations 

- Heavy, long queries: analytics, dump/restore 
- No penalties! (think 

hot_standby_feedback, locks, CPU) 


ғ ( 


Мо 


- Load testing 
- Regular HA/DR goals 
- backups 


- (but useful to check 
WAL stream, recover 
records by mistake) 
- hot standby 


(but useful to offload very 
long-running SELECTs) 


Database Lab - Open-core model 


42489 
еее 
еее 


Database Lab Engine Platform 

Open-source (AGPLv3) SaaS (pricing model: $ per TiB) 
- Thin cloning - Web console (GUI) 

- Automated provisioning and data refresh - Access control, audit 

- Data transformation, anonymization - History, visualization 


- Supports managed Postgres (AWS RDS, etc.) - Support 


https://gitlab.com/postares-ai/database-lab https://postgres.ai/ 


— follow the links and start using it for your databases 


SQL optimization using Database Lab and Joe bot 


GitLab Infrastructure (GCP) 


GitLab 
Developers 
& DBREs Joe 

Combined Instance 


Cloud Public 


4 EN. 1 IP Address | Өз» — $ Joe -- © Postgres 


using Slack 
к Control Instance ZFS 
Persistent Disk Persistent Disk 


35 


Automated checks of database migrations (DDL) using full-size thin clones provided by Database Lab 


Before Database Lab: 


- Developers test DDL on tiny databases, using only synthetic data, not seeing real behavior 
Before each release, DDL is tested on staging - a reduced/old/modified data set (~5-10% of real size) 


- Manual code review. Very rarely the change is tested on a production clone 


GitLab ‘Next Projects v Groups v Моге v 


w GitLab.com % GitLab Infrastructure Team > production > Issues > #2802 Issues with deploying D B migrations 
Opened 3 months ago by 4 ops-gitlab-net Owner were not uncommon 


2020-10-07: Postdeploy migration failure due to statement timeout 


An example: 


Summary 
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2802 


@postdeploy migration failure due to statement timeout 
Marking this as high severity since it is a deployment blocker 
https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/2028901 


/opt/gitlab/embedded/service/gitlab-rails/db/post migrate/20200914185610 schedule sync blocking issues, count.rb 


Timeline 


Automated checks of database migrations (DDL) using full-size thin clones provided by Database Lab 


fS gitlab-org/database-team/gitlab-com-migrations Maintainer ОО: 


With Database Lab: 557 (project 278964 bot - 1 week ago 


Database migrations 


Migrations included in this change have been executed on gitlab.com data for testing 
purposes. 


- Separate project 
- security: limited access, firewall 
. $ 20200716234259 237.95 
- isolation: reduced codebase and no extra components vou x 
z connected to DLE API, able to use dblab clone 20201230161206 06 


20210101110640 0.9s 


Migration Total runtime 


20210102164121 7.35 


- Оп апу СІ build in the main project (“gitlab”) has DDL, then: 
- aCl build in this special project is triggered 
- DDLis auto-verified on a fresh clone (lag «6h) provided by DLE 
- detailed artifacts are available to the Database Team and Infrastructure 


- Output - Postgres logs 
= posta - pgsa sampling 
- production timing estimates - summary 


- summary is automatically posted as an MR comment 


SQL Optimization chatbot (“Joe bot") 


Postgres.ai Console В 


Organization Switch © This is a Demo organization, once you've explored Database Lab features: Create new organization 


Urganizauons / Demo / SUL Uptimization / Joe instances / Instance #26 
Demo 


Ask Joe 
Dashboard 


Project and database 


demo. maindb 125GB Clear chat 
Database Lab 


Instances You 2021-05-17 11:34:21 
Observed sessions explain select from pgbench accounts where aid « 10000; 


SQL Optimization 


e Joe Bot 2021-05-17 11:34:22 
` 
= 


explain select from pgbench_accounts where aid < 10000; 
Ask Joe P < = : 
Session: webui-i4049 


History 
Plan with execution: 


Checkup Index Only Scan using pgbench accounts pkey2 on public.pgbench accounts (cost=0.57..3342.79 rows=9289 width-0) (actual time-0.028..364.288 


rows-9999 loops-1) 
Reports Index Cond: (pgbench accounts.aid « 10000) 
Heap Fetches: 10144 
Buffers: shared hit-945 read-738 dirtied-627 
Settings I/0 Timings: read-290.933 


General 
Members Full execution plan ~ 


Access tokens Other artifacts are provided below 


Billing Recommendations: 

! Query processes too much data to return a relatively small number of rows. — Reduce data cardinality as early as possible during the execution, using one or several of the following 
techniques: new indexes, partitioning, query rewriting, denormalization. See the visualization of the plan to understand which plan nodes are the main bottlenecks. Show details 

T VACUUM ANALYZE needed - Visibility map(s) for the table(s) involved in the query are outdated. For better performance: 1) run VACUUM ANALYZE on them as a one-time solution, 2) 
perform autovacuum tuning as a long-term permanent solution (tuning "vacuum" part of autovacuum). Show details 


Audit 


Profiling of wait events: 


[ Command 


Documentation 


Ask support 2021 © Postgres.ai Documentation | News | Terms of Service | Privacy Policy | Ask support 


SQL Optimization chatbot ("Joe bot") - summary for a single query 


Profiling of wait events: 


seconds wait event 


0.256244 IO.DataFileRead 
0.052938 Running 
0.052630 IO.SLRURead 


0.361812 


Time: 366.157 ms 
- planning: 1.104 ms 
- execution: 365.053 ms (estimated* for prod: 0.053...0.310 s) 
- І/0 read: 290.933 ms 
- I/0 write: N/A 


Shared buffers: 
- hits: 945 (47.40 MiB) from the buffer pool 
- reads: 738 (45.80 MiB) from the OS file cache, including disk 1/0 
- dirtied: 627 («4.90 MiB) 
- writes: 0 


SQL Optimization chatbot ("Joe bot") - History & Visualization 


Postgres.ai Console В Nikolay (0) (> 


© This is a Demo organization, once you've explored Database Lab features: Create new organization 


Organization Switch 


Demo Organizations / Demo / demo / SQL Optimization / History / Command #14032 


82 Dashboard Command %14032 (explain) from session #4049 


Әп Database Lab Explain Depesz Explain PEV2 Explain FlameGraph 


Instances 


Author: 


Flame Graph (buffers): 


Observed sessions 
Nikolay Samokhvalov 


Æ SQL Optimization 


Command: esci: Bg 
Ask Joe 25 


Flame Graph (timing): 


explain : z T 


History 


GATHER MERGE aims |3 


Ye Checkup Query: 


Reports select from pgbench_accounts where aid < 10000; 


Settings 
Plan: 


HASH JOIN тзатз | 6 HASH Q 


SEQSCAN Ams 1 


General 


with execution with execution (JSON) w/o execution 
Members 


Access tokens 


Rillina Index Only Scan using pgbench_accounts_pkey2 on public.pgbench_accounts (cost=0.57..3342.79 rows=9289 (ac 
Index Cond: (pgbench_accounts.aid < 10000) 
Heap Fetches: 10144 

Buffers: shared hit=945 read=738 dirtied=627 


Ш Documentation 


Database Lab "Observed sessions” 


Postgres.ai Console B Nikolay 


Organization Switch Organizations / Demo / Observed sessions / Database Lab observed session #34 


Demo 
Database Lab observed session #34 


88 Dashboard 
Summary 


Database Lab Status: 
Instances Session: #34 
Obeaved E Project: demo 

DLE instance: #35 

Duration: 2m, 55 
Created: 2 months ago 
Ask Joe Branch: transform 


5 Commit: 34e1264a823825f37aa7 8a7d6878c029f3e29301 
History Triggered by: Anatoly 


PR/MR: https://gitlab.com/postgres-ai/ci-example/-/merge requests/2 


SQL Optimization 


Checkup 
Checklist 


Reports 
pcm Dangerous locks is not observed during the session 
ai 
š (13 intervals with locks of 10 allowed) 
Settings 


General 


Session duration is within allowed interval 
(2m, 5s of 1h allowed) 


Members 


Observed intervals and details 
Access tokens 


Billing Hide intervals ^ 


Audit Started at Duration 
М 2020-11-03 14:58:19 UTC 10s 
А 2020-11-03 14:58:29 UTC 10s 


i"datname":"test small","relation":"141023","transactionid":null,"mode":"AccessExclusivelock" ," Locktype":"relation","granted":true,"usename":"ci user","query":"create table 11 as select i, 
:text as payload from generate series(1, 100000000) query start":"2020-11- 
8:26.655409400:00","state":"active","wait event type":"IO","wait event":"WALInitWrite","xact start":"2020-11-03T14:58:26.614131«00:00","xact duration":"00:00:13.236948","query start":"2020- 
11-03T14:58:26.655409400:00" "query duration":"00:00:13.19567","state change":"2020-11-03T14:58:26.655413400:00","state changed ago":"00:00:13.195666" ,"pid":45) 
test small","relation":"141028","transactionid":null,"mode":"AccessExclusiveLock" ," Locktype":"relation","granted":true,"usename":"ci user","query":"create table t1 as select i, 
ext as payload from generate series(1, 100000000 ,"query, start":"2020-11- 
26.655409«00:00" ,"state":"active","wait event type" , іс "WALInitWrite","xact start":"2020-11-03T14:58:26.614131«00:00" ,"xact duration":"00:00:13.23691","query start":"2020- 
11-03T14:58:26.655409400:00" ,"query duration":"00:00:13.19563. x '2020-11-03T14:58:26.655413400:00","state changed адо":"00:00:13.195628","рій":45) 


А 2020-11-03 14:58:39 UTC 10s 


141023", "transactionid":null,"mode":"AccessExclusiveLock" ," Locktype":"relation","granted":true, "usename "ci user","query":"create table t1 as select i, 
ext as payload from generate series(1, 100000000) i;","query start":"2020-11-03T14:58:26.655409400:00","state":"active","wait event type":null,"wait event":null,"xact start":"2020-11- 
" 1"2020-11-03T14:58:26.655409«400: 00" , "query, duration":"00:00:23.196195","state change":"2020-11- 


"test small","relation":"141028","transactionid" :null,"mode" :"AccessExclusivelock" ," Locktype":" relation","granted":true,"usename":"ci user","query":"create table t1 as select i, 
:text as payload from generate series(1, 100000000) i;","query start":"2020-11-03T14:58:26.655409400:0 :"active","wait event type":null,"wait event":null,"xact start":"2020-11- 


500: 00" ," 


oummary 


- PostgreSQL database health check is automated 
- All engineers now can do the following without delays: 
- get EXPLAIN (ANALYZE, BUFFERS) for any query for full-size DBs 
(not being blocked and not blocking others) 
- getinsights of how DDL behaves before submitting MR for DB migration review 
- learn SQL by example (using full-size databases!) 
- Database team has 
- Way to conduct various database experiments without need to provision new nodes 
and/or wait for long data refresh 
- DB migration reviews are pre-checked automatically in 100% of cases, with prediction 


of what would happen during production deployment 


Contribute 


GitLab 


Everyone can contribute 


Thank you. Please feel free to follow up! 


Nikolay Samokhvalov 


nik@postgres.al 
Twitter: @samokhvalov 


LinkedIn: linkedin.com/in/samokhvalov/ 


Jose Finotto 


ifinotto@gitlab.com 
LinkedIn: linkedin.com/in/jose-c-bb4a21 78/ 


